stepwise selection
How do you know what independent variables to include in a regression model?
After a few biostatistics classes, I began fitting my first logistic regression model using my physician friend's data on tumors excised from skin cancer patients. I realized that although we were very clear about the dependent variable we were trying to predict – a certain feature of the tumor – I really did not know how to pick the independent variables that belonged in the model. We only had a few to choose from in our dataset, and I put them all into the model, but I wasn't really sure what to do next. Remove the ones that had a slope with a p 0.05? I asked one of my professors what to do, and in her own idiosyncratic way, she seemed to describe what I will call "stepwise selection".
Machine Learning: An In-Depth Guide - Model Evaluation, Validation, Complexity, and Improvement
Welcome to the third article in a five-part series about machine learning. In this article, we'll continue our machine learning discussion, and focus on problems associated with overfitting data, as well as controlling model complexity, a model evaluation and errors introduction, model validation and tuning, and improving model performance. Overfitting is one of the greatest concerns in predictive analytics and machine learning. Overfitting refers to a situation where the model chosen to fit the training data fits too well, and essentially captures all of the noise, outliers, and so on. The consequence of this is that the model will fit the training data very well, but will not accurately predict cases not represented by the training data, and therefore will not generalize well to unseen data.